This assignment is for ETC5521 Assignment 1 by Team WOMBAT comprising of Hai Hanh Ngo and Dewi Lestari Amaliah.

1 Introduction and motivation

Obtaining data from hotel industry has always been a real challenge. Whilst we can read about the hotel jargons and terms easily, scanned through some of the common descriptive statistics of the hotel industry in market research reports or seems to stand on top of the world by pocketing some of the “industry secret” provided by the Internet for price negotiating for your next summer trip, less is known about how a hotel actually runs behind the scence. This sad truth was foreseable as no hotels would be willing to share their performance to the public eyes. Such limitation posed a challenge for researchers, scientists, hotelliers and many others to perform any studies relating to this industry.

The “hotel booking demand datasets” compiled by Nuno Antonio, Ana de Almeida and Luis Nunes (Antonio, Almeida, and Nunes 2019) was a beautiful effort to overcome such challenge. This dataset was obtained from two hotels in Portugal - one city hotel in Lisbon and one resort in Algrave. Some of the sensitive information that could reveal the identity of the two hotels was not provided, but it did not affect the important role of this dataset for the purpose of education, management, machine learning and many others.

In this study, we performed some initial analysis on the dataset to help us answer one of the question we mentioned at the beginning, about how hotels performance actually looks like behind the scence. What can we learn about the way hotels manage their customers’info, how they priced our bookings or who are more likely to cancel their bookings.

We will mainly use R programming software for this analysis along with other research materials we could find on the internet. Graphical displays are mostly utilised to support our diagnostics and analysis.

2 Data description

2.1 Dateset overview and structure

In 2018, Nuno Antonio, Ana de Almeida and Luis Nunes published a data article called “Hotel booking demand datasets” which introduced to the public two datasets for two hotels in Portugal, one resort in Algarve (H1) and one city hotel in Lisbon (H2). These datasets were originally collected to serve the development of prediction models to assess the hotels’ likelihood of having their bookings cancelled, however the uses of variables included also allow other researching purposes.

The datasets for the two hotels can be downloaded separately at https://www.sciencedirect.com/science/article/pii/S2352340918315191. However, jthomasmock at tidytuesday challenge had done us a favor and combined the two datasets into one. The two original datasets and the combined one can be obtained at https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md. For this study, we used only the combined dataset.

The dataset is stored in a csv file format with 32 variables and 119,390 observations (40,060 for H1 and 79,330 for H2). Each observation represents one hotel booking. The format of the dataset is as below:

## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resor…
## $ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75,…
## $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 20…
## $ arrival_date_month             <chr> "July", "July", "July", "July", "July"…
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27…
## $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4,…
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "B…
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GB…
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corpora…
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corpora…
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C"…
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C"…
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposi…
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", …
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL"…
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ customer_type                  <chr> "Transient", "Transient", "Transient",…
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00…
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3,…
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out",…
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 2…

Knowing this dataset belonged to a hotel, we can make sense most of the variables. However, not all of them are familiar for anyone who did not have a background of hotel management. We will run through some of the industry jargons and variables’meaning before we take a further look at the data.

Variables note:

  • is_cancelled: (1) if the booking was cancelled and (0) if not.
  • lead_time: number of days between when the booking was entered into the hotel’s booking system and the |arrival date.
  • meal: Type of meal booked which can be:
    • BB: Bed and Breakfast.
    • FB: Full board (breakfast, lunch and dinner).
    • HB: Half board (breakfast and one meal, usually dinner).
    • Undefined/ SC: no meal package.
  • country: Guests’country of origin.
  • market_segment: guests’market segment, some of which may associate with booking channel.
    • Direct: guests that make bookings directly with the hotels, could be from hotel’s website/ phone booking or walk-ins.
    • Corporate: Guests whom bookings are made by corporate/ company or guests who are business travellers.
    • Online TA: Online travel agents - bookings that made through a third party websites. Examples are Agoda, Expedia, Booking.com…
    • Offline TA/TO: Bookings made by Travel agents or Tour operators.
    • Complementary: Free stays offered for guests, usually from hotels’ promotional programs.
    • Groups: guests who travelled in groups.
    • Undefined: Undefined type of guests.
    • Aviation: We are not entirely sure but this could be airline crews.
  • Distribution_channel:
    • Direct: bookings that made directly with the hotels (hotel’s websites, phone or walk-ins)
    • Corporate: bookings made by corporate/company.
    • TA/TO: Travel agents/ tour operators/
    • Undefined: Undefined distribution channel.
    • GDS: Global Distribution System. GDS served like a hub for companies in the travel industry (airlines, hotels, car rental…) to connect with travel agents. Hotels will put some of their inventories (rooms) to the GDS and travel agents then can sell those rooms to their customers. Some of the well-known GDS include Amadeus, Sabre and Galileo.
  • is_repeated_guess: (1) if repeated and (0) if not.
  • customer_type:
    • Transient: Individuals or groups that occupy less than 10 rooms per night. These guests usually stay in the hotel short - term and require little services.
    • Contract: bookings bound by contracts, usually for more than 30 days for a consistent block of rooms.
    • Transient - Party: Transient booking but associated to other transient booking.
    • Group: bookings associated to a group, usually occupy more than 10 rooms per night.
  • previous_cancellations: number of previous cancellations prior to current booking by a customer.
  • previous_bookings_not_canceled: number of previous non - cancelled bookings prior to current booking by a customer.
  • booking_changes: number of changes made to the booking from when it was enterred into the system till the day of arrival/ cancellation.
  • agent: ID of travel agency that made the bookings.
  • company: ID of companies that made the bookings.
  • days_in_waiting_list: number of days booking was in the waiting list before it was confirmed to customers.
  • adr: Average Daily Rate, computed by taking total room revenue (excluded breakfast, tax and service charges) divided by total number of room nights sold.

You may notice that some variables contained the “NULL” values (eg: agent or company variable). This “NULL” value did not mean the value was missing, rather such value did not exist to begin with; for example a booking may not have the ID of an agent or a company associated with it as such booking was made by an individual traveller.

2.2 Limitation

  • This dataset contained data for two specific hotels in Portugal. Such limitation in study objects introduced challenges when we attempted to explain the trends observed as reasons could be hotel- specific and we could not use industry knowledge to cover.
  • Some information of the hotels were not provided, for example the number of rooms, the occupancy rate, the location of the hotels (in the busy district or at the city suburban), years of operation or special events that might have occurred. The lack of information might render some of our questions unanswered.
  • Eventhough we were provided with the collection method, we were not be able to verify the validity and correctness of the data. We noticed during our analysis that some of the entries were not sensible and could very likely due to the input errors. However we were unable to verify such concern.
  • Since the observation began in July 2015 and ended in August 2017, we only have a fully cover data by year in 2016. Moreover, the coverage of the dataset in 2015 and 2017 are only six months and eight months, respectively. Hence, we would not analyze the data in year-wise manner because it might be not apple to apple to be compared.

2.3 Time Frame of Collection

The time frame covered in this data set is from 1st July of 2015 to 31st August of 2017, daily.

2.4 Collection methods

Antonio, Almeida, and Nunes (2019) collected the data by extracting the variables from the hotels’ PMS (Property Management System) databases’ server with a TSQL query in SQL Server Studio Manager. The tables that were used to extract the variables are:

  1. BO (booking table in which the key, which is the ID, was retrieved).
  2. BL (bookings change log, in this case, if the booking details with respect to the day before arrival changed, the value used was the one present in this table).
  3. ML (meals).
  4. DC (distribution channel).
  5. TR (transaction).
  6. CP (customer profiles).
  7. NT (nationalities).
  8. MS (market segments).

A diagram below made by Antonio, Almeida, and Nunes (2019) presented the structure of the PMS databases:

PMS database diagrams

Figure 2.1: PMS database diagrams

2.5 Data Cleaning

2.5.1 Missing values checking

Bennett (2001) argued that it is important to take missing value into account, otherwise the statistical analysis will be misleading and variability of the data could not be estimated correctly. Thus, before analyzing the data, we checked the missing value using visdat package (Tierney 2017) first.

Variables Type and Missing Value Visualization

Figure 2.2: Variables Type and Missing Value Visualization

Figure 2.2 shows that there is no missing value in the dataset. It is inline with what Antonio, Almeida, and Nunes (2019) stated that there is no missing values in the database table. However, some “NULL” values were presented which should be interpreted as “not applicable”, not a missing value (Antonio, Almeida, and Nunes 2019). For example, if the the company value is NULL, it means that the booking was not made by a company.

In children variable, when we checked the missing value using naniar::where_na() function by Tierney et al. (2020), we found that there are 4 observations that are actually missing. Figure 2.2 may did not capture it because the proportion is too low (4 out of 119390 observations). We then treated these missing values by imputing it with the average of children (mean imputation) (Kang 2013) and created new variable called imputed_children.

2.5.2 Data Transformation

Figure 2.2 also conveys the type of the variables in the dataset. We could see that a lot of variables (such as hotel, distribution_channel, and is_canceled) have incorrect type. Those variables should be a factor/categorical instead of a character or numeric variables. Therefore, the type of the data should be corrected before doing the analysis. We transformed the data type using mutate function from tidyverse Wickham et al. (2019) and as.factor function from R built-in base package (R Core Team 2020). Figure 2.3 portrays that the variables have been in a correct type.

Data Type Visualization

Figure 2.3: Data Type Visualization

We also created some variables by transforming or wrangling the original variable to analyze the data. Those variables are listed as follows:

  1. is_canceled_new. This variable is actually the same with is_canceled. We only recoded the value from 0 and 1 to be “not canceled” and “canceled” in order to make it easier to interpret.
  2. lengt_of_stay is the number of days that the guests spend in the hotel. It is the summation of stays_in_weekend_nights and stays_in_week_nights variable.
  3. stay_on is a categorical variable to observe whether the guests stayed in weekend, weekday, or both.
  4. number_of_guest is a summation of adults, imputed_children, and babies variables.
  5. kids is a summation of imputed_children and babies variables.
  6. family_type is a categorical variable to observe whether the guest stayed with kids or not.

The lengt_of_stay and stay_on would be used to analyze the staying pattern on the guests. Whilst number_of_guest, kids, and family_type would be used to analyze the type of guest who stayed at the hotel. Moreover, the dataset only provides the country of bookings with country codes coded in ISO code. Hence, we transform these code to the country name using countrycode package (Arel-Bundock, Enevoldsen, and Yetman 2018).



3 Analysis and findings

We started with 31 variables to look at, and really without any distinct direction, so we started off by looking at some of the most prominent features of hotels.

3.1 Seasonality:

Tourism industry cares a lot about seasonality as it would affect the guest flows to tourist locations. Hotel season divided into two main seasons: the high and low season. As the name implies, high season is a busy season when the weather is favourable and the guests inflows are high; low season vice versa. Portugal is of no exception. The high season in Portugal usually runs in summer (June to September) and spring time (January to March) ; the beaches are usually the busiest in July and August. The low season usually falls in the winter season which starts around November and ends by the end of February. Weather during this period may feature humidity, unexpected rain and strong, cold breeze which is not too ideal for sightseeing.

Keeping the season time in mind, we are interested in finding out any possible influence seasonality may have on the guest count and adr of our hotels in study. The analysis in this section took into account only “Check-out” or successful bookings to avoid inflating numbers.

Guest Count

First, we tried to plot the guest count by month for each year to check if the seasonality is consistent through years (it should be unless some one - off events took place and disturbed the market). Problems arised when we wanted to look at the yearly subset of our records and as our dataset ran from 1 July 2015 to 31 August 2017, we could not group the data by their conventional year of 2015, 2016 and 2017. We decided to break down the dataset into three groups by allocating 12 months into each and called them year 1,2 and 3. As Year 3 only had 2 months, we dropped that year and focused only on two Year 1 (July 2015 - June 2016) and Year 2 (July 2016 to June 2017). The result is presented in figure 3.1 below.

Number of guests by month

Figure 3.1: Number of guests by month

The months are presented in the order of occurence (July was the first year for each year) to ensure the choronological order of the dataset. The overall trend was roughly the same for both years and both hotels. The seasonality trend resembled the “W” shape with the lower points of W fell into the winter months from November to January and the peak points were in spring and summer time. Interestingly, both hotels saw the higest of guests in Spring time in Year 1 (May and March) whilst in year 2, the highest was observed in the summer time (August and October).

There were some anormalities, however, spotted, of which the most noticable was that City hotel suffered a serious hit in guest counts in July 2015 whilst Resort saw no such event. This drop could be due to some hotel - specific reasons, for example the hotel might have closed down for some time for renovation; however we were unable to answer due to the lack of information of hotel’s identity. Other abnormal notice could be a reverse trend in March of Resort. Again this abnormality could not be explained and we believed it was due to hotel - specific cause rather than industry - specific.

Next we look at the Average Daily Rate or ADR in short.

Figure 3.2: ADR by month

Unlike figure 3.1, figure 3.2 showed a striking difference between the two hotels. Resort had the highest rates in the summer time, which make sense as Algrave is a beach area. Both hotels recorded the lowest rates in winter, however City hotel observed less fluctuation than resort.

We noticed that in the dataset, some of the bookings recorded a very low ADR (less than 10 euros), some may even down to 1 Euro for non - complimentary rooms in both hotels. This may seem a little strange and could created an impact on the ADR we plotted above in figure 3.2. However, we chose not to exclude those adr as we did not know the pricing policy of the hotels in question.

3.2 Exploring the Hotels’ Market Segment

According to Meier (2017), a relevant market segment is one of the key to be successful in hotel management, especially to set the price correctly. Therefore, it is essential to define the distribution of these hotels market segment.

Figure 3.3 shows that the vast majority of bookings in both hotel were made through Online Travel Agent (OTA). In addition, the portion of bookings through OTA was bigger in the city hotel than in the resort hotel. In contrast with the city hotel, we could see more portion of direct bookings in the resort hotel than group bookings.

A study from Phocuswright in pegs.com (2016) stated that one of the major reason of why do the people book trough OTA is beacuse the website is easy to use. Meanwhilst, Howe (2017) argued that OTAs are favorable because its easy booking option. According to Jedina and Ranjinib (2017) in Talwar et al. (2020), the factors for booking the hotel through OTA are the accessibility, pricing, review accountability, and the customer services.

Figure 3.3: The distribution of hotel market segment in the city and resort hotel (2015-2017)

To look further how the OTA has became the main player of these hotels market segment, we plotted the booking against the time, in this case is semester-wise. We chose this time frame because we have the full semester-wise data each year until the first semester of 2017. Hence, the comparison would be apple to apple.

Figure 3.4: The distribution of hotel market segment in the city and resort hotel by semester

Figure 3.4 shows that the OTA has taken over the domination of group bookings in the city hotel only in a half of year period. The proportion of the OTA in the first semester of 2016 has been doubled than the previous semester. Whilst in the resort hotel, the bookings trough OTA has been dominating since the beginning of the period observed. In contrast to the city hotel, in the first semester of 2016, the bookings through OTA slightly decreased and the group booking increased. We could also see that the proportion OTA bookings reached the peak in semester 2 of 2016 in both hotels.

Aside from OTA matter, Figure 3.4 shows another interesting fact that in the condition of OTA booking was dominating the market segment, the proportion of direct booking in the resort hotel was relatively stable. However, this data set do not provide any information to investigate this matter deeper. Our guess is that may be this resort has their own way to promote their direct booking, for example may be through a loyalty voucher.

To gain more insight from the market segment, especially to examine how much the OTA bookings would have contributed to hotels’ revenue, we need to know the distribution the ADR from each market segment. From Figure 3.21, we could see that despite the high number of booking from the Online TA, the cancellation rate from this market segment was also relatively high. Hence, we should examine the distribution of hotel guests from the actual bookings only, which is the guests who did not cancel the booking. It is because only this kind of booking that would bring the actual revenue for the hotels.

Figure 3.5: The distribution of hotels’ actual guest by market segment

Figure 3.5 conveys that most of actual guest in both hotels made the bookings through OTA, eventhough the cancellation rate from this market segment was also high. The reason might be because the portion of OTA booking was already big and the other categories also made cancellations.

Since complementary, aviation, and corporate categories share fewer proportion of bookings compared to the others, we would recode these categories into other category.

In the dataset, we do not have any information about the ADR by the room type. In fact, this variable would determine the booking rate. Therefore, in order to get unbiased insight, we would analyze the ADR by the room type.

We could not also also find any information of the currency used in the ADR variable. Thus, we assumed that the ADR is stated in Euro since these hotels are located in Portugal. In this variable, we also found 0 value (2.32 percent of the data), but we did not find any further information about it. We assumed that it could be the complementary room that the hotels provide for free. We would not include these observations in the analysis because it would not contribute to the hotels’ revenue.

Figure 3.6 shows that the distribution of ADR based on the market segment across room type was heterogeneous. We could also found many outlier observations. The median of ADR shown in Figure 3.6 also indicates a price competition between direct and OTA booking in both hotels. This finding is somehow in line with the current “war” between direct on OTA booking in the hotel industry. According loxleyhotels (2019) and Ting (2019) to as the booking transaction through OTA increased, so did the commission rate per booking that the hotels have to pay. The hotel then fight back to urge direct booking through the pricing (Bosworth 2017).

From this finding, we could also learn that from the customer perspective, it is might be better to book the hotel through group booking or offline travel agent because they were cheaper.

Figure 3.6: The distibution of ADR by assigned room type with BB meal in the city and resort hotel

From this exploration, we could tell that the online travel agent has dominated the market segment both in the city and resort hotels. In the city hotel, the domination began in first semester of 2016, whilst in the resort, it has began from the beginning of the period observed. In addition, the median of ADR based on market segment indicates an evidence of competition between direct and OTA bookings.

3.3 Where Did The Bookings Come From?


Another way to gain more insight about the market is to examine the origin of the travelers that book the hotel room. It is important to understand their behavior and preference. Hence, the hoteliers are able to develop strategies to attract them.

Firstly, we would like to observe the distribution of the country of the origin of the bookings in both hotel. This dataset indeed does not have any missing value, but it has a NULL value, which means that we can not gain any insight from these observations. Thus, before doing the analysis, we check for the NULL value in the country variable first. We found that 0.409 percent of observations have a NULL value in their country variable. In this case we omitted these observations because the number of observations is large enough and the proportion of NULL cases is relatively small (Kang 2013).

Figure 3.7 provided the map of the bookings origin country to get a view of the distribution of the bookings.

Figure 3.7: The distibution of travelers origin who books the hotels in 2015-2017

If we hover to the map, we found that the bookings came from most of the country across the world. In order to get a detail view, we provided the table of the ‘bookings’ country of origin. The table suggests that the bookings came from 177 different countries.



By country-wise, most bookings were from Portugal. However the number of bookings made by international travelers are also a lot, especially from the European countries. In the further analysis, we would categorize the bookings into two categories, namely “local” and “international”.

Figure 3.8: Bookings in city and resort hotel by country of origin


Figure 3.8 portrays that the international booking was dominating in both hotels. Somehow, the portion pattern is quite similar to the pattern of the bookings that made by OTA in the Figure 3.3. Therefore it is interesting to examine the market segment of the international bookings.

Figure 3.9: Distribution of international travelers by market segment


According to Figure 3.9, vast majority of international bookings were made through OTA in both hotels. This fact, once again, proved the OTA dominance in the hospitality industry. One interesting fact is that in the resort, many international travelers also booked through offline TA/TO.

We were curious whether the international travelers pay more than the local travelers. It is actually interesting question since most of bookings came from abroad. To answer this question we would only, again, using the data from actual bookings. In addition, to avoid bias in the interpretation, we would only use the observations from the same room type. According to Figure 3.10, we would use the observations from room type “A” due to the high shares of these categories in both hotels.

Figure 3.10: Meal and moom type distribution of hotels’ guests

Another fact presented in Figure 3.10 is that the room type booked were more varied in the resort hotel, whilst in the city hotel, the room choice polarized to room type A and D. This finding could be a fruitful insight that the city hotel could improve their service to these best seller rooms.

Figure 3.11: Distribution of ADR by guest origin in the city and resort hotel

In Figure 3.8, we found the fact that most bookings in both hotels are from international travelers and the ADR median presented in Figure 3.11 shows that they paid more money than the locals in both types of hotel. Hence, we could argue that these travelers has became the valuable customers to the hotels.

Since the foreign travelers has became the valuable guest to the hotel, it is better to maintain them as future long-term customer through personalization. According to Criton (2019) personalisation is the key to winning the heart of the customer. Therefore, the hotels should have a guest profile which enable them to make guest personalisation. SiteMinder.com (n.d.) mentioned that the guest profile including but are not limited to preferred length of stay and activities engaged in. Thus, using the dataset, we would try to get the information of the guest profile, in this case, the international guest.

Weekend or Weekday? Why Not Both!

During the exploration, we found an interesting case in which the guests did not spend any night in the hotel neither they canceled the bookings and no show. It means that the guest checked out in the same day they checked in. There are 675 actual bookings where the guests did a same day check out and we found that 14.07 percent are international bookings.

We observed further by checking the distribution of international guest who were really spend the nights at the hotels. We found in Figure 3.12, that most guest in the city hotel spent 3 days in the city hotel. In the resort hotel, most guests spent longer, which is 7 days (Figure 3.13). In addition, generally, the length of stay of the guests in the city hotel was relatively shorter than the guest in the resort hotel. In the resort, the guests length of stay was also more varied. These facts are not surprising because typically resort guest intentionally want to spend more days in the resort and enjoy their service. Meanwhile, the city hotel guests might be a regular travelers who wanted to explore Lisbon or did a Europe trip.

Figure 3.12: Length of stay of international guest in the city hotel

Figure 3.13: Length of stay of international guest in the resort hotel

Further, regarding the activity that the guest may engaged in, we would like to figure it out by looking at when is the time that the guest stayed at most and observing the type of guest.

Figure 3.14 shows that both in the city and resort hotel, the proportion of weekend stay bookings were the smallest. In the city hotel, most of international guests stayed in weekend and weekdays. It is also found in the resort hotel, but the portion was bigger, may be because most of guest stayed for 7 days.

Figure 3.14: The distribution of time the guest stayed at the hotels


Family Doesn’t Matter, Couple Does.

Now, we want to look at the type of international guest who stayed in both hotels. Here, we classified them by “family”, which is the guest that travel with kids, and “not a family”. Figure 3.15 portrays that the vast majority of guest in both city and resort hotel were not traveling with kids. The proportion in both hotels even exactly the same.

Figure 3.15: Type of International Guest in The City and Resort Hotel

We then look further to the actual number of the guests and found that most guest in the city and resort hotel came in couple. However, the percentage of couple guest in resort is higher in the resort than in the city hotel.

Figure 3.16: The distribution of the number of international guest in the city hotel

Figure 3.17: The distribution of the number of international guest in the resort hotel

From this exploration, we got an insight that most of booking is the two hotels are international bookings. Most of them booked the room through OTA and pay more high price then the locals in both hotels. When we looked deeper, in general, they spent a lot more time in the resort hotel, compared to the city hotel. To make a personalisation to optimize guest experience, both hotels can engage more activity for couple both in weekdays and weekend.

3.4 Reservation status

Another thing we can look at for the sake of comparison is the reservation status of these two hotels at figure 3.18 below:

Reservation status by hotel type

Figure 3.18: Reservation status by hotel type

Cancellation seems to be a problem in city hotel. Visually, the number of cancelation in city hotel was way higher than resort. However, since the total number of bookings in the two hotels were very different to begin with, we might want to check the proportion rather than the absolute number.

Table 3.1: Overall cancellation rate
hotel Canceled Check-Out No-Show total_bookings prop_cancel
City Hotel 32186 46228 916 79330 0.4057
Resort Hotel 10831 28938 291 40060 0.2704

In table 3.1, the overall cancelation rate of 3 years posed a striking difference between city hotel and resort: 40.57 percent vs 27.04 percent. What factor could possibly contribute to such difference? Did the problem come from the hotel itself or from whatever happened in the overall tourism industry at that point in time? To answer that question, we proceed to find out the common points in both hotel’s cancellation pool.

How do the cancellations have in common?

The Hotel’s PMS recored the lead time, which is the number of days elapse between the entering date of the booking into the system and the arrival date, the reservation status date and the days in waiting list, which is the number of days before the booking was confirmed with the customers. A high days in waiting list indicating that a booking takes longer to proceed and a high lead time indicating that the booking was made earlier prior to the arrival date.

System days of cancellation bookings

Figure 3.19: System days of cancellation bookings

Both hotels seems to proceed customers’ bookings with great speed with the median waiting days at around 0 days. However, city hotel had a lot of cancelled bookings that sat through a significantly high number of days in the system before they were confirmed to the customers. On other note, city hotel had a high range of lead time and days in the system, signalling that bookings which were made too early prior arrival time may have higher chance of being cancelled.

We will look futher at this assumption, but this time comparing check-out with cancelled bookings to verify the impact of lead time and days in waiting list on the cancellation rate.

Check-out vs Cancelled

We want to compare the days in waitng list and days in system for both hotels. Will a booking that takes too long to proceed (high days in waiting list) or is made too early before the arrival date (high lead time) more prone to cancellation?

System days of Check-out and Canceled

Figure 3.20: System days of Check-out and Canceled

At the first glance at figure 3.20, both types of bookings still maintained a median days of waiting as 0 days, cancelled bookings had a lower days in system compared to check-out, which was no surprises as cancellations usually are done days before the arrival date. However, lead time for cancelled bookings were much higher than of Check-out, except for the two outliners of check-out which exceed 650 days. This confirmed our assumption of lead time having an impact on the cancellation chance. The earlier the booking, the higher the chance of cancellation.

The one who got away

Having known which kind of the bookings are more likely to get cancelled, we are interested in knowing who were the one behind those bookings. We started by looking at the market segment of these cancellation:

Who are more likely to cancel?

Figure 3.21: Who are more likely to cancel?

We calculate the cancellation rate by taking the number of cancellation bookings divided by total bookings for each market segment. From figure 3.21, we see that Groups had an impressive rate of cancellation at 68.44% for City hotel and 42.14% for Resort, followed by Travel agency/ tour operator (both offline and Online). City hotel noticably had the top 3 cancellation percentage at a much higher level than Resort. Aviation or Air crew only showed up in City hotel, which actually made sense as airlines only need a place for their air crew to stay over and it does not have to be at a fancy resort.

Now that we know Groups were mostly the ones who defaulted the most, we hope to know the why they did what they did, i.e cancelled their bookings. Was this because of the deposit policy or their requests not being fullfilled? We moved on to look at deposit and other guest’s special requests in search for the reasons.

Deposit policy

We gathered all the cancelled bookings and divided them by their deposit type.

Deposit type

Figure 3.22: Deposit type

Figure 3.22 showed us something really interesting. If we looked at the absolute numbers of cancellation by counting the number of canceled bookings for each deposit type (left plot), we will have no deposit as the highest with non-refund followed by roughly a half. This came as no surprise as nowadays people were not expected to deposit for their stays, which made them more likely to cancel their bookings without spending a penny.
However, if we looked at the cancel rate by deposit type which we took the canceled bookings divided by all the bookings under each deposit type (right plot), we had a quite counter-intuitive discover. The non_refund now had the highest rate of cancellation, leaving the other two types far behind. This implied that people actually canceled the bookings that they had paid in advance! This insight called for further investigation, ideally with a different dataset of different hotels to verify its validity.

Other special requests

Finally, we wanted to know dig a little bit further into the bizarre finding we made earlier. What made people threw away good money by canceling the bookings they had paid before actually arrived at the premises? Could it be because per each deposit type, the hotels could not answer their requests (lacks parking spaces, unable to respond to special requests)? We will try to look at these elements for each deposit type to find out.

We will compute the percentage of repeated guests for all bookings, average number of previous cancellations, successful bookings, car parking space requests and special requests.

Table 3.2: Summary statistics
deposit_type bookings_count percent_repeated_guest previous_cancellations previous_successful_booking car_parking_space special_requests
No Deposit 104641 0.0357604 0.0420390 0.1549106 0.0711289 0.6514273
Non Refund 14587 0.0043875 0.4114623 0.0106259 0.0000686 0.0017824
Refundable 162 0.0246914 0.0000000 0.0185185 0.1234568 0.1419753

Table 3.2 added more to the confusion when non-refund bookings had customers who were either quite new to the hotels (low % of repeated guests) or had made a lot of cancellations before (highest previous cancellations, lowest successful booking), rarely required parking space (lowest average requests for car parking space) and rarely posed special requests (lowest average special requests) yet still willing to pay in advance only to cancel later. To put in plain words, these customers were either too pissed off in the past (high previous cancellations), too new to the hotels or not too picky but still decided to pay first and lost all the advance payment by abadoning the bookings. This point clearly asked for further investigation to decide whether this is hotel-specific or industry-specific.

4 Conclusion

In this study, we tried to explore the dataset in many angels. We started by comparing the two hotels featured in this study across all of the variables provided in the dataset.

We found out that the two hotels followed the overall seasonality trend in Portugal where high season fell in the spring and summer time. The ADR for two hotels were priced at a different rates with City hotel observed less fluctuation than Resort did. Also, by market segment, the ADR of the OTA and Direct booking channel appreared to be quite competitive even though the OTA’s prices were still a bit better.

By guests origin, around 40% - 44% of the bookings were from local tourists. A vast majority of international guests were delivered through OTA channel (50% - 68%) which again emphasized the influence of this channel over the hotel industry. We also learnt that it was the international guests who paid more for the services and the room choices.

Another thing we could learn of the guests’behaviour was that guests would spend more days in resort rather than city hotel. Our dataset pointed out that resort had a handful of guests staying for up to 7 days whilst in the city hotel was 3 days. There were 675 cases recorded where guests did not stay overnight and 14.07% of those were of international travellers, higlighting a difference that locations of the two hotels could make with the resorts being in the beach area of Algrave and the city hotels in Lisbon where guests might be a regular travelers who wanted to explore Lisbon or did a Europe trip. Interestingly, most guests stayed in our two hotels were couples with resort having received more couples than city hotel.

On a not so bright side, city hotels observed a 1.5 times higher in cancellation rate than resort. Most cancellations were common in that they steamed either from the bookings that were made too early in prior of the actual arrival date, or from bookings made by Groups and Travel agencies (both online and offline). We suspected that the loose cancellation policy of the OTA had promoted guests to abandon their bookings. Another counter - intuitive finding we discovered was that non - refund deposit group held the highest rate of cancellation; coupled with further looking into factors such as guests’special requests, we infered that customers who dropped out were either quite new to the hotels, had made a lot of cancellations before, rarely required parking space or rarely posed special requests yet still willing to pay in advance only to cancel later.

Throughout this study, we had encountered a lot of questions which can both be answered and not. We started with some initial generic questions and actually ended up with more detailed questions and an unexpected turn where we did not expect in the beginning (the cancellation digging was a nice surprise). We also had to make a lot of assumptions to facilitate our analysis which in turns, enriched our knowledge base as we had to scan through a myriad of studies looking for the answers. Overall, we enjoyed our little challenge with this dataset and hope that anyone who reads this excersise could be inspired too.

Reference

Antonio, Nuno, Ana de Almeida, and Luis Nunes. 2019. “Hotel Booking Demand Datasets.” Data in Brief 22. Elsevier Inc: 41–49.

Arel-Bundock, Vincent, Nils Enevoldsen, and CJ Yetman. 2018. “Countrycode: An R Package to Convert Country Names and Country Codes.” Journal of Open Source Software 3 (28): 848. https://doi.org/10.21105/joss.00848.

Bennett, Derrick A. 2001. “How Can I Deal with Missing Data in My Study?” Australian and New Zealand Journal of Public Health 25 (5). Oxford, UK: Blackwell Publishing Ltd: 464–69.

Bosworth, Patrick. 2017. “Hoteliers Must Pressure Otas in Competition for Direct Bookings.” https://www.phocuswire.com/duetto-partner-article-december-2017.

Criton. 2019. “The Importance of Personalisation in the Hospitality Industry.” https://www.criton.com/news-hub/the-importance-of-personalisation-in-the-hospitality-industry/.

Howe, Neil. 2017. “Hotels Versus Otas: Who Is Winning over Millenial Travelers?” https://www.forbes.com/sites/neilhowe/2017/07/31/hotels-versus-otas-who-is-winning-over-millennial-travelers/#27440fd5277a.

Jedina, Mohd Haniff, and Kohila Ranjinib. 2017. “Exploring the Key Factors of Hotel Online Booking Through Online Travel Agency.” In 4th International Conference on E-Commerce (Icoec) 2017 Held in Malaysia.

Kang, Hyun. 2013. “The Prevention and Handling of the Missing Data.” Korean Journal of Anesthesiology 64 (5): 402.

loxleyhotels. 2019. “Who’s Winning the Direct Bookings Wars Between Hotels and Otas?” https://loxleyhotels.com/index.php/2019/03/15/direct-booking-hotels-and-otas/.

Meier, Veit. 2017. “Market Segmentation - Know Where Your Hotel Demand Comes from.” https://www.bernerbecker.com/latest-articles/market-segmentation-know-hotel-demand-comes/.

pegs.com. 2016. “Why Do Travellers Prefer Booking with Otas?” https://www.pegs.com/blog/why-do-travelers-prefer-booking-with-otas/.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Talwar, Shalini, Amandeep Dhir, Puneet Kaur, and Matti Mäntymäki. 2020. “Why Do People Purchase from Online Travel Agencies (Otas)? A Consumption Values Perspective.” International Journal of Hospitality Management 88. Elsevier Ltd.

Tierney, Nicholas. 2017. “Visdat: Visualising Whole Data Frames.” JOSS 2 (16). Journal of Open Source Software: 355. https://doi.org/10.21105/joss.00355.

Tierney, Nicholas, Di Cook, Miles McBain, and Colin Fay. 2020. Naniar: Data Structures, Summaries, and Visualisations for Missing Data. https://CRAN.R-project.org/package=naniar.

Ting, Deanna. 2019. “Who’s Really Winning the Direct Booking Wars Between Hotels and Online Travel Agencies?”

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.